DATA SCIENCE SESSIONS VOL. 3¶

A Foundational Python Data Science Course¶

Session 17: Generalized Linear Models II. Binomial Logistic Regression and ROC analysis. Regularization of BLR.¶

← Back to course webpage

Feedback should be send to goran.milovanovic@datakolektiv.com.

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

Lecturers¶

Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner

Aleksandar Cvetković, PhD, DataKolektiv, Consultant

Ilija Lazarević, MA, DataKolektiv, Consultant


What do we want to do today?

In this session, we continue to develop the Binomial Logistic Regression model and introduce ROC (Receiver Operative Characteristic) analysis for classification problems. From ROC analysis, we will learn a lot about metrics for evaluating classification models: the True Positive, False Positive, True Negative, and False Negative rates; Precision and Recall; the F1 score. We will learn how to plot the model ROC curve and analyse it to determine the optimal decision threshold for any given Binomial Logistic Regression model.

We will optimize the Binomial logistic model directly using an optimization algorithm from the Scipy module which completes our theoretical study of this problem.

Finally, we introduce additional model indicators that take into account model complexity such as the Akaike Information Criterion (AIC). Finally, we regularize the Binomial Logistic Regression model and learn how to perform Binomial Logistic Regression in scikit-learn.

In [ ]:
### --- Setup - importing the libraries

# - supress those annoying 'Future Warning'
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# - data
import numpy as np
import pandas as pd

# - os
import os

# - ml
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor

# - visualization
import matplotlib.pyplot as plt
import seaborn as sns

# - parameters
%matplotlib inline
pd.options.mode.chained_assignment = None  # default='warn'
sns.set_theme()
# - rng
rng = np.random.default_rng()
# - plots
plt.rc("figure", figsize=(8, 6))
plt.rc("font", size=14)
sns.set_theme(style='white')

# - directory tree
data_dir = os.path.join(os.getcwd(), '_data')

1. Binomial Logistic Regression, ROC analysis.¶

In [ ]:
# - loading the dataset
# - GitHub: https://github.com/dijendersaini/Customer-Churn-Model/blob/master/churn_data.csv
# - place it in your _data/ directory
churn_data = pd.read_csv(os.path.join(data_dir, 'churn_data.csv'))
churn_data.head(10)
Out[ ]:
customerID tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG 1 No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE 34 Yes One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK 2 Yes Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW 45 No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU 2 Yes Month-to-month Yes Electronic check 70.70 151.65 Yes
5 9305-CDSKC 8 Yes Month-to-month Yes Electronic check 99.65 820.5 Yes
6 1452-KIOVK 22 Yes Month-to-month Yes Credit card (automatic) 89.10 1949.4 No
7 6713-OKOMC 10 No Month-to-month No Mailed check 29.75 301.9 No
8 7892-POOKP 28 Yes Month-to-month Yes Electronic check 104.80 3046.05 Yes
9 6388-TABGU 62 Yes One year No Bank transfer (automatic) 56.15 3487.95 No
In [ ]:
churn_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   tenure            7043 non-null   int64  
 2   PhoneService      7043 non-null   object 
 3   Contract          7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
 8   Churn             7043 non-null   object 
dtypes: float64(1), int64(1), object(7)
memory usage: 495.3+ KB
In [ ]:
# - use .replace method to replace empty strings with NaN values
churn_data = churn_data.replace(r'^\s*$', np.nan, regex=True)
In [ ]:
# - we drop all the entries with missing values
churn_data = churn_data.dropna().reset_index(drop=True)
In [ ]:
# - notice that 'TotalCharges' values are non-numeric type, but they should be
# - this is due to the empty string values that were previously present
# - we convert them to numeric type
churn_data['TotalCharges'] = churn_data['TotalCharges'].astype('float')
churn_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7032 non-null   object 
 1   tenure            7032 non-null   int64  
 2   PhoneService      7032 non-null   object 
 3   Contract          7032 non-null   object 
 4   PaperlessBilling  7032 non-null   object 
 5   PaymentMethod     7032 non-null   object 
 6   MonthlyCharges    7032 non-null   float64
 7   TotalCharges      7032 non-null   float64
 8   Churn             7032 non-null   object 
dtypes: float64(2), int64(1), object(6)
memory usage: 494.6+ KB

Target: Predict churn from all numeric predictors¶

In [ ]:
### --- Preparing the model frame

# - extracting 'Churn' and all the numerical features columns
model_frame = churn_data[['Churn', 'tenure', 'MonthlyCharges', 'TotalCharges']]
model_frame.head()
Out[ ]:
Churn tenure MonthlyCharges TotalCharges
0 No 1 29.85 29.85
1 No 34 56.95 1889.50
2 Yes 2 53.85 108.15
3 No 45 42.30 1840.75
4 Yes 2 70.70 151.65
In [ ]:
# - encoding 'Churn' values to binary values
model_frame['Churn'] = model_frame['Churn'].apply(lambda x: int(x == 'Yes'))
model_frame.head()
Out[ ]:
Churn tenure MonthlyCharges TotalCharges
0 0 1 29.85 29.85
1 0 34 56.95 1889.50
2 1 2 53.85 108.15
3 0 45 42.30 1840.75
4 1 2 70.70 151.65
In [ ]:
predictors = model_frame.columns.drop('Churn')
predictors
Out[ ]:
Index(['tenure', 'MonthlyCharges', 'TotalCharges'], dtype='object')
In [ ]:
# --- Composing the fomula of the model

# - right side of the formula
formula = ' + '.join(predictors)

# - left side of the formula
formula = 'Churn ~ ' + formula

formula
Out[ ]:
'Churn ~ tenure + MonthlyCharges + TotalCharges'
In [ ]:
# - fitting BLR model to the data
binomial_linear_model = smf.logit(formula=formula, data=model_frame).fit()
Optimization terminated successfully.
         Current function value: 0.453372
         Iterations 7
In [ ]:
binomial_linear_model.summary()
Out[ ]:
Logit Regression Results
Dep. Variable: Churn No. Observations: 7032
Model: Logit Df Residuals: 7028
Method: MLE Df Model: 3
Date: Sat, 06 May 2023 Pseudo R-squ.: 0.2170
Time: 00:09:33 Log-Likelihood: -3188.1
converged: True LL-Null: -4071.7
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
Intercept -1.5988 0.117 -13.628 0.000 -1.829 -1.369
tenure -0.0671 0.005 -12.297 0.000 -0.078 -0.056
MonthlyCharges 0.0302 0.002 17.585 0.000 0.027 0.034
TotalCharges 0.0001 6.14e-05 2.361 0.018 2.47e-05 0.000
In [ ]:
# - predicting the probabilities
probabilities = binomial_linear_model.predict()
probabilities[:10]
Out[ ]:
array([0.31861382, 0.13162129, 0.47723817, 0.04417324, 0.60445587,
       0.72962176, 0.47459352, 0.2095354 , 0.53216748, 0.02770232])
In [ ]:
# - predicting binary labels, taking \sigma = 0.5
predictions = (probabilities > .5).astype('int')
predictions[:10]
Out[ ]:
array([0, 0, 0, 0, 1, 1, 0, 0, 1, 0])
In [ ]:
# - observed vs. predicted labels

predictions_df = pd.DataFrame()

predictions_df['observation'] = model_frame['Churn']
predictions_df['prediction'] = predictions

predictions_df.head()
Out[ ]:
observation prediction
0 0 0
1 0 0
2 1 0
3 0 0
4 1 1

The ROC analysis¶

True Positives (Hits): data say 1 and prediction is 1

In [ ]:
# - model True Positive Rate (TPR, Hit rate)
tpr = (predictions_df['observation'] == 1)&(predictions_df['prediction'] == 1)
tpr = np.sum(tpr)/np.sum(predictions_df['observation'] == 1)
np.round(tpr, 4)
Out[ ]:
0.4425

False Positives (False Alarms): data say 0 and prediction is 1

In [ ]:
# - model False Positive Rate (FPR, False Alarm rate)
fpr = (predictions_df['observation'] == 0)&(predictions_df['prediction'] == 1)
fpr = np.sum(fpr)/np.sum(predictions_df['observation'] == 0)
np.round(fpr, 4)
Out[ ]:
0.091

True Negatives (Correct Rejections): data say 0 and prediction is 0

In [ ]:
# - model True Negative Rate (TNR, Correct Rejection rate)
tnr = (predictions_df['observation'] == 0)&(predictions_df['prediction'] == 0)
tnr = np.sum(tnr)/np.sum(predictions_df['observation'] == 0)
np.round(tnr, 4)
Out[ ]:
0.909

False Negatives (Misses): data say 1 and prediction is 0

In [ ]:
# - model False Negative Rate (FNR, Miss rate)
fnr = (predictions_df['observation'] == 1)&(predictions_df['prediction'] == 0)
fnr = np.sum(fnr)/np.sum(predictions_df['observation'] == 1)
np.round(fnr, 4)
Out[ ]:
0.5575

True Positive Rate + False Negative Rate (or Hit + Miss)

In [ ]:
tpr + fnr
Out[ ]:
1.0

False Positive Rate + True Negative Rate (or False Alarm + Correct Rejection)

In [ ]:
fpr + tnr
Out[ ]:
1.0
In [ ]:
### --- Constructing the ROC curve of the model

# - calculating model metrics for varying decision criteria \sigma
dec_criterion = np.arange(.01, .99, step=.01)

observations = model_frame['Churn']

hits = []
fas = []
accuracies = []

for x in dec_criterion:
    
    predictions =  probabilities > x
    
    accuracy = np.sum(observations == predictions)/len(observations)
    hit = np.sum((observations == 1)&(predictions == 1))/np.sum(observations == 1)
    fa = np.sum((observations == 0)&(predictions == 1))/np.sum(observations == 0)
    
    accuracies.append(accuracy)
    fas.append(fa)
    hits.append(hit)
In [ ]:
roc_frame = pd.DataFrame()

roc_frame['hit'] = hits
roc_frame['fa'] = fas
roc_frame['accuracy'] = accuracies
roc_frame['dec'] = dec_criterion

roc_frame.head()
Out[ ]:
hit fa accuracy dec
0 0.998930 0.935890 0.312571 0.01
1 0.993579 0.889793 0.344994 0.02
2 0.989834 0.847182 0.375284 0.03
3 0.987694 0.809413 0.402446 0.04
4 0.979668 0.771063 0.428470 0.05
In [ ]:
# - difference between the hit and false alarm rates
roc_frame['diff'] = roc_frame['hit'] - roc_frame['fa']
roc_frame.head()
Out[ ]:
hit fa accuracy dec diff
0 0.998930 0.935890 0.312571 0.01 0.063040
1 0.993579 0.889793 0.344994 0.02 0.103787
2 0.989834 0.847182 0.375284 0.03 0.142652
3 0.987694 0.809413 0.402446 0.04 0.178281
4 0.979668 0.771063 0.428470 0.05 0.208605
In [ ]:
# - identifying the entry with the highest hit-false alarm rate difference
diff_argmax = roc_frame['diff'].argmax()
print(f'The optimal model is:\n{roc_frame.iloc[diff_argmax]}')
The optimal model is:
hit         0.784912
fa          0.316483
accuracy    0.710466
dec         0.250000
diff        0.468429
Name: 24, dtype: float64
In [ ]:
# - plotting the ROC curve of the model
# - the point with the biggest hit-false alarm rate difference is marked
sns.lineplot(data=roc_frame, x='dec', y='dec', color='black')
sns.lineplot(data=roc_frame, x='fa', y='hit')

plt.scatter(x=roc_frame.loc[diff_argmax, 'fa'], y=roc_frame.loc[diff_argmax, 'hit'], c='r')
plt.text(x=roc_frame.loc[diff_argmax, 'fa']-.08, y=roc_frame.loc[diff_argmax, 'hit'], s='Here!')

sns.despine(offset=10, trim=True)
plt.xlabel('False Positive (False Alarm) Rate', fontsize=14)
plt.ylabel('True Positive (Hit) Rate', fontsize=14)
plt.title('ROC Analysis for the Binomial Regression Model', fontsize=16);

The Akaike Information Criterion (AIC)¶

The Akaike Information Criterion (AIC) is a statistical measure used to evaluate the goodness-of-fit of a model. It is based on the principle of parsimony, which states that simpler models should be preferred over more complex ones, all else being equal.

The AIC is defined as follows:

$$AIC = -2\ln(\mathcal{L}) + 2k $$

where $\mathcal{L}$ is the model likelihood and $k$ is the number of parameters in the model.

The AIC penalizes models with more parameters, by adding a penalty term $2k$ to the log-likelihood $-2\ln(\mathcal{L})$. This penalty term is larger for models with more parameters, and hence it discourages overfitting and encourages simpler models.

The AIC can be used to compare different models and select the best one based on their AIC values. The model with the lowest AIC value is preferred, as it strikes a good balance between goodness-of-fit and simplicity.

In [ ]:
# - Akaike Information Criterion (AIC)
binomial_linear_model.aic
Out[ ]:
6384.226174634843
In [ ]:
# - another way to compute AIC
model_loglike = binomial_linear_model.llf
model_loglike
aic = -2*model_loglike + 2*len(predictors)
aic
Out[ ]:
6382.226174634843

Model effect: comparison to the Null Model¶

In Binomial Logistic Regression, the null model is a model with no predictor variables, meaning that it only includes an intercept term. The null model predicts the probability of the outcome variable based on its overall frequency in the sample, without taking any other variables into account.

Model Log-Likelihood

In [ ]:
# - Log-likelihood of model
model_loglike = binomial_linear_model.llf
model_loglike
Out[ ]:
-3188.1130873174216

Null Model Log-Likelihood

In [ ]:
# Value of the constant-only loglikelihood
null_loglike = binomial_linear_model.llnull
null_loglike
Out[ ]:
-4071.6775733255813
In [ ]:
# - Comparison to the Null Model which follows the Chi-Square distribution
# - Likelihood ratio chi-squared statistic; G = -2*(llnull - llf)
dev_diff = binomial_linear_model.llr
dev_diff
Out[ ]:
1767.1289720163195

The Likelihood ratio Chi-Squared statistic (sometimes termed: G) follows the $\chi^2$ distirbution.

In [ ]:
print(f'G: {-2*(null_loglike - model_loglike)}')
print(binomial_linear_model.llr_pvalue)
G: 1767.1289720163195
0.0

Target: Predict churn from all the predictors¶

In [ ]:
# - exponential of the parameters and AIC of the model using only numerical predictors (a reminder)
np.exp(binomial_linear_model.params)
Out[ ]:
Intercept         0.202133
tenure            0.935088
MonthlyCharges    1.030660
TotalCharges      1.000145
dtype: float64
In [ ]:
binomial_linear_model.aic
Out[ ]:
6384.226174634843
In [ ]:
### --- Prepering the dataset

# - droping the 'customerID' column
model_frame = churn_data.drop(columns='customerID')
model_frame.head()
Out[ ]:
tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 1 No Month-to-month Yes Electronic check 29.85 29.85 No
1 34 Yes One year No Mailed check 56.95 1889.50 No
2 2 Yes Month-to-month Yes Mailed check 53.85 108.15 Yes
3 45 No One year No Bank transfer (automatic) 42.30 1840.75 No
4 2 Yes Month-to-month Yes Electronic check 70.70 151.65 Yes
In [ ]:
model_frame.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   tenure            7032 non-null   int64  
 1   PhoneService      7032 non-null   object 
 2   Contract          7032 non-null   object 
 3   PaperlessBilling  7032 non-null   object 
 4   PaymentMethod     7032 non-null   object 
 5   MonthlyCharges    7032 non-null   float64
 6   TotalCharges      7032 non-null   float64
 7   Churn             7032 non-null   object 
dtypes: float64(2), int64(1), object(5)
memory usage: 439.6+ KB
In [ ]:
### --- Preparing the dataset

# - droping the 'customerID' column
model_frame = churn_data.drop(columns='customerID')
model_frame.head()
Out[ ]:
tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 1 No Month-to-month Yes Electronic check 29.85 29.85 No
1 34 Yes One year No Mailed check 56.95 1889.50 No
2 2 Yes Month-to-month Yes Mailed check 53.85 108.15 Yes
3 45 No One year No Bank transfer (automatic) 42.30 1840.75 No
4 2 Yes Month-to-month Yes Electronic check 70.70 151.65 Yes
In [ ]:
# - encoding 'Churn' column to binary values
model_frame['Churn'] = model_frame['Churn'].apply(lambda x: int(x == 'Yes'))
model_frame.head()
Out[ ]:
tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 1 No Month-to-month Yes Electronic check 29.85 29.85 0
1 34 Yes One year No Mailed check 56.95 1889.50 0
2 2 Yes Month-to-month Yes Mailed check 53.85 108.15 1
3 45 No One year No Bank transfer (automatic) 42.30 1840.75 0
4 2 Yes Month-to-month Yes Electronic check 70.70 151.65 1
In [ ]:
predictors = model_frame.columns.drop('Churn')
predictors
Out[ ]:
Index(['tenure', 'PhoneService', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges'],
      dtype='object')
In [ ]:
# --- Composing the fomula of the model

# - right side of the formula
formula = ' + '.join(predictors)

# - left side of the formula
formula = 'Churn ~ ' + formula

formula
Out[ ]:
'Churn ~ tenure + PhoneService + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges'
In [ ]:
# - fitting BLR model to the data
binomial_linear_model = smf.logit(formula=formula, data=model_frame).fit()
Optimization terminated successfully.
         Current function value: 0.424667
         Iterations 8
In [ ]:
binomial_linear_model.summary()
Out[ ]:
Logit Regression Results
Dep. Variable: Churn No. Observations: 7032
Model: Logit Df Residuals: 7021
Method: MLE Df Model: 10
Date: Sat, 06 May 2023 Pseudo R-squ.: 0.2666
Time: 00:09:35 Log-Likelihood: -2986.3
converged: True LL-Null: -4071.7
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
Intercept -0.8446 0.176 -4.794 0.000 -1.190 -0.499
PhoneService[T.Yes] -0.8419 0.122 -6.926 0.000 -1.080 -0.604
Contract[T.One year] -0.9171 0.103 -8.876 0.000 -1.120 -0.715
Contract[T.Two year] -1.8135 0.172 -10.565 0.000 -2.150 -1.477
PaperlessBilling[T.Yes] 0.4230 0.073 5.812 0.000 0.280 0.566
PaymentMethod[T.Credit card (automatic)] -0.1006 0.112 -0.895 0.371 -0.321 0.120
PaymentMethod[T.Electronic check] 0.4126 0.093 4.453 0.000 0.231 0.594
PaymentMethod[T.Mailed check] -0.1194 0.113 -1.061 0.289 -0.340 0.101
tenure -0.0606 0.006 -9.932 0.000 -0.073 -0.049
MonthlyCharges 0.0225 0.002 11.022 0.000 0.019 0.027
TotalCharges 0.0003 6.81e-05 4.205 0.000 0.000 0.000
In [ ]:
# - exponentials of the new model parameters
np.exp(binomial_linear_model.params)
Out[ ]:
Intercept                                   0.429725
PhoneService[T.Yes]                         0.430907
Contract[T.One year]                        0.399695
Contract[T.Two year]                        0.163084
PaperlessBilling[T.Yes]                     1.526520
PaymentMethod[T.Credit card (automatic)]    0.904265
PaymentMethod[T.Electronic check]           1.510691
PaymentMethod[T.Mailed check]               0.887475
tenure                                      0.941170
MonthlyCharges                              1.022802
TotalCharges                                1.000286
dtype: float64
In [ ]:
# - AIC of the new model
binomial_linear_model.aic
Out[ ]:
5994.512151127206
In [ ]:
### --- Constructing the ROC curve of the model

# - calculating model metrics for varying decision criteria \sigma
dec_criterion = np.arange(.01, .99, step=.01)

probabilities = binomial_linear_model.predict()
observations = model_frame['Churn']

hits = []
fas = []
accuracies = []

for x in dec_criterion:
    
    predictions =  probabilities > x
    
    accuracy = np.sum(observations == predictions)/len(observations)
    hit = np.sum((observations == 1)&(predictions == 1))/np.sum(observations == 1)
    fa = np.sum((observations == 0)&(predictions == 1))/np.sum(observations == 0)
    
    accuracies.append(accuracy)
    fas.append(fa)
    hits.append(hit)
In [ ]:
roc_frame = pd.DataFrame()

roc_frame['hit'] = hits
roc_frame['fa'] = fas
roc_frame['accuracy'] = accuracies
roc_frame['dec'] = dec_criterion
roc_frame['diff'] = roc_frame['hit'] - roc_frame['fa']

roc_frame.head()
Out[ ]:
hit fa accuracy dec diff
0 0.996255 0.872942 0.358077 0.01 0.123313
1 0.990904 0.791788 0.416240 0.02 0.199117
2 0.986624 0.731551 0.459329 0.03 0.255072
3 0.983414 0.688166 0.490330 0.04 0.295248
4 0.979668 0.653883 0.514505 0.05 0.325785
In [ ]:
# - identifying the entry with the highest hit-false alarm rate difference
diff_argmax = roc_frame['diff'].argmax()
print(f'The optimal model is:\n{roc_frame.iloc[diff_argmax]}')
The optimal model is:
hit         0.838416
fa          0.318420
accuracy    0.723265
dec         0.240000
diff        0.519997
Name: 23, dtype: float64
In [ ]:
# - plotting the ROC curve of the model
# - the point with the biggest hit-false alarm rate difference is marked
sns.lineplot(data=roc_frame, x='dec', y='dec', color='black')
sns.lineplot(data=roc_frame, x='fa', y='hit')

plt.scatter(x=roc_frame.loc[diff_argmax, 'fa'], y=roc_frame.loc[diff_argmax, 'hit'], c='r')
plt.text(x=roc_frame.loc[diff_argmax, 'fa']-.08, y=roc_frame.loc[diff_argmax, 'hit'], s='Here!')

sns.despine(offset=10, trim=True)
plt.xlabel('False Positive (False Alarm) Rate', fontsize=14)
plt.ylabel('True Positive (Hit) Rate', fontsize=14)
plt.title('ROC Analysis for the Binomial Regression Model', fontsize=16);

ROC Analysis Elaborated¶

  • The $TPR$ (True-Positive Rate, or Hit) is also known as Sensitivity, Recall, or Probability of Detection. It is the probability that the classifier correctly recognizes the target class, i.e. $P(Predict=C_T|Observation=C_T)$.
  • The $FPR$ (False-Positive Rate, or False Alarm) is the probability that the classifier incorrectly predicts the target class when the other class is really in case, i.e. $P(Predict=C_T|Observation=\overline{C_T})$.
  • The $TNR$ (True-Negative Rate, or Correct Rejection), a.k.a. as specificity or selectivity is the probability that the classifier correctly predicts the absence of the target class when the other class is really in case, i.e. $P(Predict=\overline{C_T}|Observation=\overline{C_T})$.
  • The $FNR$ (False-Negative Rate, or Miss) is the probability that the classifier incorrectly predicts the absence of the target class when the target class is really in case, i.e. $P(Predict=\overline{C_T}|Observation=C_T)$.

The ROC plot can thus also be understood as $1-Specificity$ vs $Sensitivity$ plot (beacuse $FPR = 1 - Specificity$). In mathematical statistics, the Type I Error indicates a situation in which our statistical model predicts something which is not the case in the population (that is your $\alpha$ level in statistical analyses): this concept is completely mapped by our $FPR$ or False Alarm. On the other hand, Statistical Power is the probability by which a statistical model (or test) can successfully recover an occurrence in the population: and this is perfectly matched by our understanding of $TPR$ or Hit Rate. Thus, the ROC also plots the Type I Error against Statistical Power.

Precision (or Positive Predictive Value (PPV)) and False Discovery Rate (FDR)

$$PPV = \frac{TP}{TP+FP}=1-FDR$$

The classifier's Precision is the ratio of True Positives to the sum of True Positives and False Positives: the ratio of correct ("relevant") classifications to the number of positive classifications made.

$$FDR = \frac{FP}{FP+TP}=1-PPV$$

The classifier's False Discovery Rate is the ratio of False Positives to the sum of True Positives and False Positives: the ratio of incorrect ("irrelevant") classifications to the number of positive classifications made.

Accuracy and Balanced Accuracy

In cases of highly imbalanced classes in binary classification, Accuracy can give us a dangerously misleading result:

$$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$$

Balanced Accuracy ($bACC$) can be used to correct for class imbalance:

$$bACC=\frac{TPR+TNR}{2}$$

To understand how $bACC$ works, think of the following case: we have a dataset with 100 observations of which 75 are class $C_1$ and 25 of class $C_2$; hence, the model that always predict $C_1$ and never $C_2$ must be accurate 75% of time, right? However, it's $bACC$ is only .5 (because its $TPR=1$ and $TNR=0$).

The $F_1$ score

This is traditionally used to asses how good a classifier is:

$$F_1 = 2\frac{Precision\cdot Recall}{Precision+Recall}$$

and is also known as F-measure or balanced F-score.

Note.

  • Precision is important to use when minimizing false positives is the goal: maximizing precision will minimize the number of false positives;
  • Recall is important to use when minimizing false negatives is the goal: maximizing the recall will minimize the number of false negatives:
$$Precision = \frac{TP}{TP+FP}$$$$Recall = \frac{TP}{TP+FN}$$

We can thus use a more general $F-score$, $F_\beta$, where $\beta$ determines the times recall is considered as important as precision:

$$F_\beta = (1+\beta^2)\frac{Precision\cdot Recall}{\beta^2 \cdot Precision+Recall}$$

Area Under the Curve (AUC)

This is probably the most frequently used indicator of model fit in classification. Given that the ideal classifier is found in the top left corner of the ROC plot, i.e. where $TPR=1$ and $FPR=0$, it is obvious that the best model in some comparison is the one with the greatest area under the ROC.

In [ ]:
# - ROC Analysis continued

# - identifying the entry with the highest hit-false alarm rate difference
diff_argmax = roc_frame['diff'].argmax()
optimal_model = roc_frame.iloc[diff_argmax]
print(f'The optimal model is:\n{optimal_model}')
The optimal model is:
hit         0.838416
fa          0.318420
accuracy    0.723265
dec         0.240000
diff        0.519997
Name: 23, dtype: float64

Balanced Accuracy to correct for class imbalance: $bACC=\frac{TPR+TNR}{2}$

In [ ]:
#bACC
tpr = optimal_model['hit']
tnr = 1-optimal_model['fa']
bACC = (tpr+tnr)/2
print(f'The model bACC: {bACC:.3f}')
The model bACC: 0.760

Precision (or Positive Predictive Value (PPV)): $PPV = \frac{TP}{TP+FP}=1-FDR$

In [ ]:
# Precision
tpr = optimal_model['hit']
fpr = optimal_model['fa']
precision = tpr/(tpr+fpr)
print(f'The model Precision: {precision:.3f}')
The model Precision: 0.725

The model Recall is simply the Hit rate (TPR):

In [ ]:
# Recall
recall = optimal_model['hit']
print(f'The model Recall: {recall:.3f}')
The model Recall: 0.838

The $F_1$ score

$$F_1 = 2\frac{Precision\cdot Recall}{Precision+Recall}$$
In [ ]:
# F1 Score
f1 = 2*((precision*recall)/(precision+recall))
print(f'The model F1 score: {f1:.3f}')
The model F1 score: 0.777

2. BLR using scikit-learn¶

sklearn: Predicting churn from numerical predictors¶

In [ ]:
# - import scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

Predicting with numerical variables only:

In [ ]:
### --- Preparing the variables 

# - feature matrix
X = churn_data[['tenure', 'MonthlyCharges', 'TotalCharges']].values

# - target variable
y = churn_data['Churn'].apply(lambda x: int(x == 'Yes'))
In [ ]:
## --- Fitting the logistic model to the numerical data
log_reg = LogisticRegression()
log_reg.fit(X, y)
Out[ ]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [ ]:
# - coefficients of the model
log_reg.coef_, log_reg.intercept_
Out[ ]:
(array([[-0.06711264,  0.03019993,  0.00014507]]), array([-1.59884834]))
In [ ]:
# - coefficients of the model
log_reg.coef_, log_reg.intercept_
Out[ ]:
(array([[-0.06711264,  0.03019993,  0.00014507]]), array([-1.59884834]))
In [ ]:
# - model's accuracy
round(log_reg.score(X, y), 4)
Out[ ]:
0.785
In [ ]:
# - confusion matrix for the given data
y_pred = log_reg.predict(X)
confusion_matrix(y, y_pred)
Out[ ]:
array([[4693,  470],
       [1042,  827]])

sklearn: Predicting churn from all the predictors¶

In [ ]:
churn_data.head()
Out[ ]:
customerID tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG 1 No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE 34 Yes One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK 2 Yes Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW 45 No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU 2 Yes Month-to-month Yes Electronic check 70.70 151.65 Yes
In [ ]:
### --- Composing the feature matrix

# - dropping all the non-numerical and non-binary categorical columns
X0 = churn_data.drop(columns=['customerID', 'Contract', 'PaymentMethod', 'Churn'])

# - encoding binary categorical features to binary values
X0['PaperlessBilling'] = X0['PaperlessBilling'].apply(lambda x: int(x == 'Yes'))
X0['PhoneService'] = X0['PhoneService'].apply(lambda x: int(x == 'Yes'))

X0.head()
Out[ ]:
tenure PhoneService PaperlessBilling MonthlyCharges TotalCharges
0 1 0 1 29.85 29.85
1 34 1 0 56.95 1889.50
2 2 1 1 53.85 108.15
3 45 0 0 42.30 1840.75
4 2 1 1 70.70 151.65
In [ ]:
# - casting the data frame into a matrix
X0 = X0.values
X0
Out[ ]:
array([[1.0000e+00, 0.0000e+00, 1.0000e+00, 2.9850e+01, 2.9850e+01],
       [3.4000e+01, 1.0000e+00, 0.0000e+00, 5.6950e+01, 1.8895e+03],
       [2.0000e+00, 1.0000e+00, 1.0000e+00, 5.3850e+01, 1.0815e+02],
       ...,
       [1.1000e+01, 0.0000e+00, 1.0000e+00, 2.9600e+01, 3.4645e+02],
       [4.0000e+00, 1.0000e+00, 1.0000e+00, 7.4400e+01, 3.0660e+02],
       [6.6000e+01, 1.0000e+00, 1.0000e+00, 1.0565e+02, 6.8445e+03]])
In [ ]:
# - categories of the 'Contract' variable
churn_data['Contract'].unique()
Out[ ]:
array(['Month-to-month', 'One year', 'Two year'], dtype=object)
In [ ]:
# - categories of the 'PaymentMethod' variable
churn_data['PaymentMethod'].unique()
Out[ ]:
array(['Electronic check', 'Mailed check', 'Bank transfer (automatic)',
       'Credit card (automatic)'], dtype=object)
In [ ]:
# - we want to recreate the previous statsmodels model that was using all the predictors
# - to achieve this we one-hot (dummy) encode non-binary categorical predictors
# - statsmodels chooses the first category in order of appearance in the dataset as the reference category
# - we pass the reference category manually as an argument to the OneHotEncoder

enc_contract = OneHotEncoder(drop=['Month-to-month'], sparse=False)
dummy_contract = enc_contract.fit_transform(churn_data['Contract'].values.reshape(-1, 1))

enc_payment = OneHotEncoder(drop=['Bank transfer (automatic)'], sparse=False)
dummy_payment = enc_payment.fit_transform(churn_data['PaymentMethod'].values.reshape(-1, 1))
In [ ]:
# - concatenating values of the numerical predictors and encoded binary values with the encoded non-binary values
# - into a feature matrix
X = np.concatenate((X0, dummy_contract, dummy_payment), axis=-1)
display(X)

# - target variable; encoding to binary values
y = churn_data['Churn'].apply(lambda x: int(x == 'Yes'))
array([[ 1.,  0.,  1., ...,  0.,  1.,  0.],
       [34.,  1.,  0., ...,  0.,  0.,  1.],
       [ 2.,  1.,  1., ...,  0.,  0.,  1.],
       ...,
       [11.,  0.,  1., ...,  0.,  1.,  0.],
       [ 4.,  1.,  1., ...,  0.,  0.,  1.],
       [66.,  1.,  1., ...,  0.,  0.,  0.]])
In [ ]:
### --- Fitting the logistic model to all the data
log_reg = LogisticRegression(solver='newton-cg', penalty='none')
log_reg.fit(X, y)
Out[ ]:
LogisticRegression(penalty='none', solver='newton-cg')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(penalty='none', solver='newton-cg')
In [ ]:
# - model's accuracy
round(log_reg.score(X, y), 4)
Out[ ]:
0.7964
In [ ]:
# - exponential of the model parameters
# - ordering corresponds to the ordering of the features in the feature matrix
np.exp(log_reg.coef_), np.exp(log_reg.intercept_)
Out[ ]:
(array([[0.9411695 , 0.43090694, 1.52652001, 1.02280179, 1.00028639,
         0.39969523, 0.16308373, 0.90426487, 1.51069102, 0.88747499]]),
 array([0.42972514]))
In [ ]:
# - confusion matrix for the given data
y_pred = log_reg.predict(X)
confusion_matrix(y, y_pred)
Out[ ]:
array([[4615,  548],
       [ 884,  985]])

3. MLE for Binomial Logistic Regression¶

Say we have observed the following data: $HHTHTTHHHT$. Assume that we know the parameter $p_H$. We can compute the Likelihood function from the following equation:

$\mathcal{L}(p_H|HHTHTTHHHT)$ exactly as we did before. Now, this is the general form of the Binomial Likelihood (where $Y$ stands for the observed data):

$$\mathcal{L}(p|Y) = p_1^y(1-p_1)^{n-y}$$

where $y$ is the number of successes and $n$ the total number of observations. For each observed data point then we have

$$\mathcal{L}(p|y_i) = p_1^{y_i}(1-p_1)^{\bar{y_i}}$$

where ${y_i}$ is the observed value of the outcome, $Y$, and $\bar{y_i}$ is its complement (e.g. $1$ for $0$ and $0$ for $1$). This form just determines which value will be used in the computation of the Likelihood function at each observed data point $y_i$: it will be either $p_1$ or $1-p_1$. The likelihood function for a given value of $p_1$ for the whole dataset is computed by multiplying the values of $\mathcal{L}(p|y_i)$ across the whole dataset (remember that multiplication in Probability is what conjunction is in Logic and Algebra).

Q: But... how do we get to $p_1$, the parameter value that we will use at each data point?

A: We will search the parameter space, of course, $\beta_0, \beta_1, ... \beta_k$ of linear coefficients in our Binary Logistic Model, computing $p_1$ every time, and compute the likelihood function from it! In other words: we will search the parameter space to find the combination of $\beta_0, \beta_1, ... \beta_k$ that produces the maximum of the likelihood function similarly as we have searched the space of linear coefficients to find the combination that minimizes the squared error in Simple Linear Regression.

So what combination of the linear coefficients is the best one?

It is the one which gives the Maximum Likelihood. This approach, known as Maximum Likelihood Estimation (MLE), stands behind many important statistical learning models. It presents the corner stone of the Statistical Estimation Theory. It is contrasted with the Least Squares Estimation that we have earlier used to estimate Simple and Multiple Linear Regression models.

Now, there is a technical problem related to this approach. To obtain the likelihood for the whole dataset one needs to multiply as many very small numbers as there are data points (because $p_1 < 1$, and even $p_1 \ll 1$). That can cause computational problems related to the smallest real numbers that can be represented by digital computers. The workaround is to use the logarithm of likelihood instead, known as Log-Likelihood ($LL$).

Thus, while the Likelihood function for the whole dataset would be

$$\mathcal{L}(p|Y) = \prod_{i=1}^{n}p_1^{y_i}(1-p_1)^{\bar{y_i}}$$

the Log-Likelihood function would be:

$$LL(p|Y) = \sum_{i=1}^{n} y_ilog(p_1)+\bar{y_i}log(1-p_1)$$

And finally here is how we solve the Binomial Logistic Regression problem:

  • search throught the parameter space spawned by linear coefficients $\beta_0, \beta_1, ... \beta_k$,
  • predict $p_1$ from the model and a particular combination of the parameters,
  • compute the value of the Likelihood function for the whole dataset,
  • find the combination that yields the maximum of the Likelihood function.

Technically, in optimization we would not go exactly for the maximum of the Likelihood function, because we use $LL$ instead of $\mathcal{L}(p|Y)$. The solution is to minimize the negative $LL$, sometimes written simply as $NLL$, the Negative Log-Likelihood function.

In [ ]:
model_frame = churn_data[['Churn', 'MonthlyCharges', 'TotalCharges']]
# - encoding 'Churn' values to binary values
model_frame['Churn'] = model_frame['Churn'].apply(lambda x: int(x == 'Yes'))
model_frame
Out[ ]:
Churn MonthlyCharges TotalCharges
0 0 29.85 29.85
1 0 56.95 1889.50
2 1 53.85 108.15
3 0 42.30 1840.75
4 1 70.70 151.65
... ... ... ...
7027 0 84.80 1990.50
7028 0 103.20 7362.90
7029 0 29.60 346.45
7030 1 74.40 306.60
7031 0 105.65 6844.50

7032 rows × 3 columns

Implement the model predictive pass given the parameters; the folowing blr_predict() function is nothing else than the implementation of the following expression:

$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}}$$
In [ ]:
def blr_predict(params, data):
    # - grab parameters
    beta_0 = params[0]
    beta_1 = params[1]
    beta_2 = params[2]
    # - predict: model function
    x1 = data["MonthlyCharges"]
    x2 = data["TotalCharges"]
    # - linear combination:
    lc = beta_0 + beta_1*x1 + beta_2*x2
    ep = np.exp(-lc)
    p = 1/(1+ep)
    return(p) 

Test blr_predict()

In [ ]:
test_params = np.array([-2.7989, .0452, -.0006])
predictions = blr_predict(params=test_params, data=model_frame)
predictions
Out[ ]:
0       0.187309
1       0.204491
2       0.394181
3       0.120110
4       0.575848
          ...   
7027    0.460025
7028    0.072292
7029    0.158578
7030    0.593878
7031    0.106194
Length: 7032, dtype: float64

Now define the Negative Log-Likelihood function:

In [ ]:
from scipy.stats import binom

def blr_nll(params, data):
    # - predictive pass
    p = blr_predict(params, data)
    # - joint negative log-likelihood
    # - across all observations
    nll = -binom.logpmf(data["Churn"], 1, p).sum()
    return(nll)

Test blr_nll():

In [ ]:
# - test blr_nll()
test_params = np.array([-2.7989, .0452, -.0006])
blr_nll(params=test_params, data=model_frame)
Out[ ]:
3284.57719221028

Optimize!

In [ ]:
from scipy.optimize import minimize

# - initial (random) parameter values
init_beta_0 = np.random.uniform(low=-3, high=0, size=1)
init_beta_1 = np.random.uniform(low=-.05, high=.05, size=1)
init_beta_2 = np.random.uniform(low=-.001, high=.001, size=1)
init_pars = [init_beta_0, init_beta_1, init_beta_2]

# - optimize w. Nelder-Mead
optimal_model = minimize(
    # - fun(parameters, args)
    fun=blr_nll,
    args = model_frame, 
    x0 = init_pars, 
    method='Nelder-Mead',
    options={'maxiter':1e6, 
            'maxfev':1e6,
            'fatol':1e-6})

# - optimal parameters
for param in optimal_model.x:
    print(param)
-2.798897478214532
0.04522958083546827
-0.0006181190783916775
/tmp/ipykernel_16308/465358202.py:10: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0.
  optimal_model = minimize(
In [ ]:
optimal_model.fun
Out[ ]:
3283.393916678725

Check against statsmodels

In [ ]:
predictors = model_frame.columns.drop('Churn')
formula = ' + '.join(predictors)
formula = 'Churn ~ ' + formula
print(formula)
binomial_linear_model = smf.logit(formula=formula, data=model_frame).fit()
binomial_linear_model.summary()
Churn ~ MonthlyCharges + TotalCharges
Optimization terminated successfully.
         Current function value: 0.466922
         Iterations 6
Out[ ]:
Logit Regression Results
Dep. Variable: Churn No. Observations: 7032
Model: Logit Df Residuals: 7029
Method: MLE Df Model: 2
Date: Sat, 06 May 2023 Pseudo R-squ.: 0.1936
Time: 00:09:40 Log-Likelihood: -3283.4
converged: True LL-Null: -4071.7
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
Intercept -2.7989 0.089 -31.548 0.000 -2.973 -2.625
MonthlyCharges 0.0452 0.001 31.620 0.000 0.042 0.048
TotalCharges -0.0006 1.97e-05 -31.373 0.000 -0.001 -0.001

Plot the Model Log-Likelihood Function

In [ ]:
# - from statsmodels: beta_0 is -2.7989, beta_1 is .0452, and beta_2 is -.0006
beta_0 = -2.7989
beta_1_vals = np.linspace(-.05,.05,100)
beta_2_vals = np.linspace(-.001,.001,100)
grid = np.array([(beta_1, beta_2) for beta_1 in beta_1_vals for beta_2 in beta_2_vals])
grid = pd.DataFrame(grid)
grid = grid.rename(columns={0: "beta_1", 1: "beta_2"})
nll = []
for i in range(grid.shape[0]):
    pars = [beta_0, grid['beta_1'][i], grid['beta_2'][i]]
    l = -blr_nll(pars, model_frame)
    nll.append(l)
grid['nll'] = nll
grid.sort_values('nll', ascending=False, inplace=True)
grid.head(100)
Out[ ]:
beta_1 beta_2 nll
9419 0.044949 -0.000616 -3283.545964
9518 0.045960 -0.000636 -3283.967556
9420 0.044949 -0.000596 -3284.466526
9519 0.045960 -0.000616 -3285.218180
9320 0.043939 -0.000596 -3285.320667
... ... ... ...
9917 0.050000 -0.000657 -3324.443332
9027 0.040909 -0.000455 -3324.735600
9315 0.043939 -0.000697 -3324.857073
9523 0.045960 -0.000535 -3325.140288
9118 0.041919 -0.000636 -3325.271093

100 rows × 3 columns

In [ ]:
# - import plotly
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"

# - Mesh3d: Objective Function
fig = go.Figure(data=[go.Mesh3d(
    x=grid['beta_1'], 
    y=grid['beta_2'], 
    z=grid['nll'], 
    color='red', 
    opacity=0.50)])
fig.update_layout(scene = dict(
                    xaxis_title='Beta_1',
                    yaxis_title='Beta_2',
                    zaxis_title='LL'),
                    width=700,
                    margin=dict(r=20, b=10, l=10, t=10))
fig.show()

4. Regularization of BLR¶

We will use scikit-learn to regularize the BLR model immediately.

In [ ]:
### --- Preparing the variables 

# - feature matrix
X = churn_data[['tenure', 'MonthlyCharges', 'TotalCharges']].values

# - target variable
y = churn_data['Churn'].apply(lambda x: int(x == 'Yes'))

It is recommended to standardize the feature matrix $X$ before using regularized logistic regression in scikit-learn, particularly if the features have different scales. Regularization techniques like $L1$ and $L2$ regularization are sensitive to the scale of the features, and features with larger scales may dominate the regularization penalty. Standardizing the features can help to mitigate this issue and ensure that all features contribute equally to the model.

To standardize the features in scikit-learn, we use the StandardScaler transformer from the sklearn.preprocessing module:

In [ ]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(X)

LASSO (L1) Regularization¶

Please make sure to study the the LogisticRegression() documentation from Scikit-Learn: sklearn.linear_model.LogisticRegression:

C float, default=1.0

Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

In [ ]:
# Create an instance of the logistic regression model with L1 regularization
logistic_model = LogisticRegression(penalty='l1',
                                    C = .15, 
                                    solver='liblinear')

# Fit the model on the training data
logistic_model.fit(X, y)

# Print the model coefficients (the intercept and the coefficients of the predictors)
print("Model coefficients:", logistic_model.intercept_, logistic_model.coef_)
Model coefficients: [-1.43661609] [[-1.42078062  0.94437135  0.10327116]]

Examine a range of C inverse regularization penalty:

In [ ]:
# define the range of C (inverse penalty) values to try
C_range = [0.001, 0.01, 0.1, 1, 10, 100]

# create empty list to store the score for each C value
scores = []

# iterate over the C values and calculate the cross-validation score for each
for C in C_range:
    model = LogisticRegression(penalty='l1', C=C, solver='liblinear').fit(X,y)
    sc = model.score(X, y)
    scores.append(sc)

    # print the mean and standard deviation of the scores for each C value
    print(f"C = {C:.3f}: Score = {sc:.3f}")

# find the best C value based on the mean score
best_C = C_range[scores.index(max(scores))]
print(f"Best C = {best_C:.3f}")
C = 0.001: Score = 0.734
C = 0.010: Score = 0.789
C = 0.100: Score = 0.785
C = 1.000: Score = 0.785
C = 10.000: Score = 0.785
C = 100.000: Score = 0.785
Best C = 0.010

Ridge (L2) Regularization¶

In [ ]:
# Create an instance of the logistic regression model with L2 regularization
logistic_model = LogisticRegression(penalty='l2',
                                    C = .15, 
                                    solver='liblinear')

# Fit the model on the training data
logistic_model.fit(X, y)

# Print the model coefficients (the intercept and the coefficients of the predictors)
print("Model coefficients:", logistic_model.intercept_, logistic_model.coef_)
Model coefficients: [-1.4361844] [[-1.4546272   0.929324    0.14909584]]

Examine a range of C inverse regularization penalty:

In [ ]:
# define the range of C (inverse penalty) values to try
C_range = [0.001, 0.01, 0.1, 1, 10, 100]

# create empty list to store the score for each C value
scores = []

# iterate over the C values and calculate the cross-validation score for each
for C in C_range:
    model = LogisticRegression(penalty='l2', C=C, solver='liblinear').fit(X,y)
    sc = model.score(X, y)
    scores.append(sc)

    # print the mean and standard deviation of the scores for each C value
    print(f"C = {C:.3f}: Score = {sc:.3f}")

# find the best C value based on the mean score
best_C = C_range[scores.index(max(scores))]
print(f"Best C = {best_C:.3f}")
C = 0.001: Score = 0.789
C = 0.010: Score = 0.787
C = 0.100: Score = 0.785
C = 1.000: Score = 0.785
C = 10.000: Score = 0.785
C = 100.000: Score = 0.785
Best C = 0.001

ElasticNet Regularization¶

l1_ratio float, default=None

The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'. Setting l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.

In [ ]:
# Create an instance of the logistic regression model with L2 regularization
logistic_model = LogisticRegression(penalty='elasticnet',
                                    l1_ratio = .5,
                                    C = .15,                                 
                                    solver='saga')

# Fit the model on the training data
logistic_model.fit(X, y)

# Print the model coefficients (the intercept and the coefficients of the predictors)
print("Model coefficients:", logistic_model.intercept_, logistic_model.coef_)
Model coefficients: [-1.45022017] [[-1.46181079  0.93680138  0.1456478 ]]
In [ ]:
# Create an instance of the logistic regression model with L2 regularization
logistic_model = LogisticRegression(penalty='elasticnet',
                                    l1_ratio = .5,
                                    C = .15, 
                                    max_iter=10000,                                
                                    solver='saga')

# Fit the model on the training data
logistic_model.fit(X, y)

# Print the model coefficients (the intercept and the coefficients of the predictors)
print("Model coefficients:", logistic_model.intercept_, logistic_model.coef_)
Model coefficients: [-1.45020376] [[-1.46189613  0.93674334  0.14572617]]

Now, using F1 to score the model:

In [ ]:
from sklearn.metrics import f1_score

# define the range of C (inverse penalty) values to try
C_range = [0.001, 0.01, 0.1, 1, 10, 100]

# define the range of l1_ratio (ElasticNet mix parameter) values to try
l1_range = [0, .25, .5, .75, 1]

# create empty list to store the score for each C value
scores = []
cs = []
l1_ratios = []
f1s = []

# iterate over the C values and calculate the cross-validation score for each
for C in C_range:
    for l1 in l1_range:
        model = LogisticRegression(penalty='elasticnet', 
                                   C = C, 
                                   l1_ratio = l1,
                                   max_iter = 10000,
                                   solver='saga').fit(X,y)
        sc = model.score(X, y)
        scores.append(sc)
        # store C, l1_ratio
        cs.append(C)
        l1_ratios.append(l1)
        # predict the outcome on new data
        y_pred = model.predict(X)
        # calculate the F1 score
        f1 = f1_score(y, y_pred)
        f1s.append(f1)
        # print the score for actual C, l1 value, accuracy, and F1
        print(f"C = {C:.3f}, l1_ratio = {l1:.3f}: Accuracy = {sc:.7f}, F1 = {f1:.7f}")

# find the best C, l1_ratio values based on F1
best_ix = f1s.index(max(f1s))
best_C = cs[best_ix]
print(f"Best C = {best_C:.3f}")
best_l1_ratio = l1_ratios[best_ix]
print(f"Best l1_ratio = {best_l1_ratio:.3f}")
best_f1 = f1s[best_ix]
print(f"Best F1 = {best_f1:.7f}")
best_acc = scores[best_ix]
print(f"Best Accuracy = {best_acc:.7f}")
C = 0.001, l1_ratio = 0.000: Accuracy = 0.7417520, F1 = 0.0677618
C = 0.001, l1_ratio = 0.250: Accuracy = 0.7342150, F1 = 0.0000000
C = 0.001, l1_ratio = 0.500: Accuracy = 0.7342150, F1 = 0.0000000
C = 0.001, l1_ratio = 0.750: Accuracy = 0.7342150, F1 = 0.0000000
C = 0.001, l1_ratio = 1.000: Accuracy = 0.7342150, F1 = 0.0000000
C = 0.010, l1_ratio = 0.000: Accuracy = 0.7889647, F1 = 0.5089345
C = 0.010, l1_ratio = 0.250: Accuracy = 0.7891069, F1 = 0.5061605
C = 0.010, l1_ratio = 0.500: Accuracy = 0.7901024, F1 = 0.5050302
C = 0.010, l1_ratio = 0.750: Accuracy = 0.7899602, F1 = 0.5045287
C = 0.010, l1_ratio = 1.000: Accuracy = 0.7902446, F1 = 0.5051996
C = 0.100, l1_ratio = 0.000: Accuracy = 0.7851251, F1 = 0.5204697
C = 0.100, l1_ratio = 0.250: Accuracy = 0.7852673, F1 = 0.5209391
C = 0.100, l1_ratio = 0.500: Accuracy = 0.7854096, F1 = 0.5217116
C = 0.100, l1_ratio = 0.750: Accuracy = 0.7854096, F1 = 0.5223172
C = 0.100, l1_ratio = 1.000: Accuracy = 0.7852673, F1 = 0.5221519
C = 1.000, l1_ratio = 0.000: Accuracy = 0.7846985, F1 = 0.5214918
C = 1.000, l1_ratio = 0.250: Accuracy = 0.7848407, F1 = 0.5219589
C = 1.000, l1_ratio = 0.500: Accuracy = 0.7848407, F1 = 0.5219589
C = 1.000, l1_ratio = 0.750: Accuracy = 0.7848407, F1 = 0.5219589
C = 1.000, l1_ratio = 1.000: Accuracy = 0.7848407, F1 = 0.5219589
C = 10.000, l1_ratio = 0.000: Accuracy = 0.7848407, F1 = 0.5222608
C = 10.000, l1_ratio = 0.250: Accuracy = 0.7848407, F1 = 0.5222608
C = 10.000, l1_ratio = 0.500: Accuracy = 0.7848407, F1 = 0.5222608
C = 10.000, l1_ratio = 0.750: Accuracy = 0.7848407, F1 = 0.5222608
C = 10.000, l1_ratio = 1.000: Accuracy = 0.7848407, F1 = 0.5222608
C = 100.000, l1_ratio = 0.000: Accuracy = 0.7849829, F1 = 0.5227273
C = 100.000, l1_ratio = 0.250: Accuracy = 0.7849829, F1 = 0.5224258
C = 100.000, l1_ratio = 0.500: Accuracy = 0.7848407, F1 = 0.5222608
C = 100.000, l1_ratio = 0.750: Accuracy = 0.7848407, F1 = 0.5222608
C = 100.000, l1_ratio = 1.000: Accuracy = 0.7849829, F1 = 0.5227273
Best C = 100.000
Best l1_ratio = 0.000
Best F1 = 0.5227273
Best Accuracy = 0.7849829

NOTE the following:

In [ ]:
y.value_counts()
Out[ ]:
0    5163
1    1869
Name: Churn, dtype: int64

Ooops. Class imbalance problems? Use the class_weight argument to LogisticRegression():

class_weight dict or ‘balanced’, default=None

Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

In [ ]:
n_samples = X.shape[0]
n_classes = 2
print(np.bincount(y))
n_samples/(n_classes * np.bincount(y))
[5163 1869]
Out[ ]:
array([0.68099942, 1.8812199 ])
In [ ]:
from sklearn.metrics import f1_score

# define the range of C (inverse penalty) values to try
C_range = [0.001, 0.01, 0.1, 1, 10, 100]

# define the range of l1_ratio (ElasticNet mix parameter) values to try
l1_range = [0, .25, .5, .75, 1]

# create empty list to store the score for each C value
scores = []
cs = []
l1_ratios = []
f1s = []

# iterate over the C values and calculate the cross-validation score for each
for C in C_range:
    for l1 in l1_range:
        model = LogisticRegression(penalty='elasticnet', 
                                   C = C, 
                                   l1_ratio = l1,
                                   max_iter = 10000,
                                   class_weight = 'balanced',
                                   solver='saga').fit(X,y)
        sc = model.score(X, y)
        scores.append(sc)
        # store C, l1_ratio
        cs.append(C)
        l1_ratios.append(l1)
        # predict the outcome on new data
        y_pred = model.predict(X)
        # calculate the F1 score
        f1 = f1_score(y, y_pred)
        f1s.append(f1)
        # print the score for actual C, l1 value, accuracy, and F1
        print(f"C = {C:.3f}, l1_ratio = {l1:.3f}: Accuracy = {sc:.7f}, F1 = {f1:.7f}")

# find the best C, l1_ratio values based on F1
best_ix = f1s.index(max(f1s))
best_C = cs[best_ix]
print(f"Best C = {best_C:.3f}")
best_l1_ratio = l1_ratios[best_ix]
print(f"Best l1_ratio = {best_l1_ratio:.3f}")
best_f1 = f1s[best_ix]
print(f"Best F1 = {best_f1:.7f}")
best_acc = scores[best_ix]
print(f"Best Accuracy = {best_acc:.7f}")
C = 0.001, l1_ratio = 0.000: Accuracy = 0.7197099, F1 = 0.5909940
C = 0.001, l1_ratio = 0.250: Accuracy = 0.7040671, F1 = 0.5830495
C = 0.001, l1_ratio = 0.500: Accuracy = 0.6868601, F1 = 0.5689115
C = 0.001, l1_ratio = 0.750: Accuracy = 0.6616894, F1 = 0.5455587
C = 0.001, l1_ratio = 1.000: Accuracy = 0.6396473, F1 = 0.5217063
C = 0.010, l1_ratio = 0.000: Accuracy = 0.7251138, F1 = 0.5840327
C = 0.010, l1_ratio = 0.250: Accuracy = 0.7238339, F1 = 0.5839760
C = 0.010, l1_ratio = 0.500: Accuracy = 0.7245449, F1 = 0.5881352
C = 0.010, l1_ratio = 0.750: Accuracy = 0.7245449, F1 = 0.5883103
C = 0.010, l1_ratio = 1.000: Accuracy = 0.7238339, F1 = 0.5875106
C = 0.100, l1_ratio = 0.000: Accuracy = 0.7236917, F1 = 0.5854491
C = 0.100, l1_ratio = 0.250: Accuracy = 0.7238339, F1 = 0.5853971
C = 0.100, l1_ratio = 0.500: Accuracy = 0.7238339, F1 = 0.5848653
C = 0.100, l1_ratio = 0.750: Accuracy = 0.7235495, F1 = 0.5837259
C = 0.100, l1_ratio = 1.000: Accuracy = 0.7235495, F1 = 0.5828326
C = 1.000, l1_ratio = 0.000: Accuracy = 0.7246871, F1 = 0.5875586
C = 1.000, l1_ratio = 0.250: Accuracy = 0.7245449, F1 = 0.5872576
C = 1.000, l1_ratio = 0.500: Accuracy = 0.7246871, F1 = 0.5875586
C = 1.000, l1_ratio = 0.750: Accuracy = 0.7245449, F1 = 0.5872576
C = 1.000, l1_ratio = 1.000: Accuracy = 0.7245449, F1 = 0.5872576
C = 10.000, l1_ratio = 0.000: Accuracy = 0.7239761, F1 = 0.5867575
C = 10.000, l1_ratio = 0.250: Accuracy = 0.7239761, F1 = 0.5867575
C = 10.000, l1_ratio = 0.500: Accuracy = 0.7241183, F1 = 0.5870583
C = 10.000, l1_ratio = 0.750: Accuracy = 0.7239761, F1 = 0.5867575
C = 10.000, l1_ratio = 1.000: Accuracy = 0.7239761, F1 = 0.5867575
C = 100.000, l1_ratio = 0.000: Accuracy = 0.7241183, F1 = 0.5870583
C = 100.000, l1_ratio = 0.250: Accuracy = 0.7241183, F1 = 0.5870583
C = 100.000, l1_ratio = 0.500: Accuracy = 0.7241183, F1 = 0.5870583
C = 100.000, l1_ratio = 0.750: Accuracy = 0.7241183, F1 = 0.5870583
C = 100.000, l1_ratio = 1.000: Accuracy = 0.7241183, F1 = 0.5870583
Best C = 0.001
Best l1_ratio = 0.000
Best F1 = 0.5909940
Best Accuracy = 0.7197099

The class imbalance problem can severely affect your model ROC analysis; especially Accuracy as an indicator of model fit does not make too much sense when class imbalance is present in the data.

Further Reading¶

  • Scikit-Learn sklearn.linear_model.LogisticRegression documentation
  • A Gentle Introduction to Threshold-Moving for Imbalanced Classification, by Jason Brownlee on February 10, 2020
  • Optimization: Loss Function Under the Hood (Part I)
  • Loss Function (Part II): Logistic Regression

DataKolektiv, 2022/23.

hello@datakolektiv.com

License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.